Source Code: https://github.com/djlofland/DATA622_S2021_Group2/tree/master/Homework4
For this assignment, we will be working with a very interesting mental health dataset from a real-life research project. All identifying information, of course, has been removed. The attached spreadsheet has the data (the tab name “Data”). The data dictionary is given in the second tab. You can get as creative as you want. The assignment is designed to really get you to think about how you could use different methods.
We start by loading the Excel dataset provided into a dataframe.
# Show df summary deatils
skim(df)| Name | df |
| Number of rows | 175 |
| Number of columns | 54 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 53 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Initial | 0 | 1 | 2 | 3 | 0 | 108 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1.00 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | ▆▅▇▅▁ |
| Sex | 0 | 1.00 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | ▇▁▁▁▆ |
| Race | 0 | 1.00 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | ▇▁▁▁▁ |
| ADHD Q1 | 0 | 1.00 | 1.70 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▇▇▇▆▃ |
| ADHD Q2 | 0 | 1.00 | 1.91 | 1.25 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD Q3 | 0 | 1.00 | 1.91 | 1.27 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD Q4 | 0 | 1.00 | 2.10 | 1.34 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▅▆ |
| ADHD Q5 | 0 | 1.00 | 2.26 | 1.44 | 0 | 1.0 | 3 | 3.0 | 5 | ▇▅▇▆▁ |
| ADHD Q6 | 0 | 1.00 | 1.91 | 1.31 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▅▇▇▃ |
| ADHD Q7 | 0 | 1.00 | 1.83 | 1.19 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▃▃ |
| ADHD Q8 | 0 | 1.00 | 2.14 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▇▆ |
| ADHD Q9 | 0 | 1.00 | 1.91 | 1.32 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▇▇▇▅ |
| ADHD Q10 | 0 | 1.00 | 2.12 | 1.23 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▇▇▆▅ |
| ADHD Q11 | 0 | 1.00 | 2.27 | 1.24 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▆▇▇▆ |
| ADHD Q12 | 0 | 1.00 | 1.29 | 1.21 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▂▂ |
| ADHD Q13 | 0 | 1.00 | 2.37 | 1.23 | 0 | 1.5 | 2 | 3.0 | 4 | ▂▅▇▇▆ |
| ADHD Q14 | 0 | 1.00 | 2.25 | 1.35 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▇▆ |
| ADHD Q15 | 0 | 1.00 | 1.63 | 1.39 | 0 | 0.0 | 1 | 3.0 | 4 | ▇▆▆▅▃ |
| ADHD Q16 | 0 | 1.00 | 1.70 | 1.38 | 0 | 1.0 | 1 | 3.0 | 4 | ▆▇▆▃▅ |
| ADHD Q17 | 0 | 1.00 | 1.53 | 1.29 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▇▃▃ |
| ADHD Q18 | 0 | 1.00 | 1.47 | 1.30 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▃▃ |
| ADHD Total | 0 | 1.00 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | ▃▆▇▆▂ |
| MD Q1a | 0 | 1.00 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1b | 0 | 1.00 | 0.57 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1c | 0 | 1.00 | 0.54 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q1d | 0 | 1.00 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1e | 0 | 1.00 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1f | 0 | 1.00 | 0.70 | 0.46 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q1g | 0 | 1.00 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q1h | 0 | 1.00 | 0.56 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1i | 0 | 1.00 | 0.59 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1j | 0 | 1.00 | 0.39 | 0.49 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▅ |
| MD Q1k | 0 | 1.00 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q1L | 0 | 1.00 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1m | 0 | 1.00 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q2 | 0 | 1.00 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q3 | 0 | 1.00 | 2.01 | 1.07 | 0 | 1.0 | 2 | 3.0 | 3 | ▂▂▁▅▇ |
| MD TOTAL | 0 | 1.00 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | ▃▃▆▇▇ |
| Alcohol | 4 | 0.98 | 1.35 | 1.39 | 0 | 0.0 | 1 | 3.0 | 3 | ▇▂▁▁▆ |
| THC | 4 | 0.98 | 0.81 | 1.27 | 0 | 0.0 | 0 | 1.5 | 3 | ▇▁▁▁▃ |
| Cocaine | 4 | 0.98 | 1.09 | 1.39 | 0 | 0.0 | 0 | 3.0 | 3 | ▇▁▁▁▅ |
| Stimulants | 4 | 0.98 | 0.12 | 0.53 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Sedative-hypnotics | 4 | 0.98 | 0.12 | 0.54 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Opioids | 4 | 0.98 | 0.39 | 0.99 | 0 | 0.0 | 0 | 0.0 | 3 | ▇▁▁▁▁ |
| Court order | 5 | 0.97 | 0.09 | 0.28 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▁ |
| Education | 9 | 0.95 | 11.90 | 2.17 | 6 | 11.0 | 12 | 13.0 | 19 | ▁▅▇▂▁ |
| Hx of Violence | 11 | 0.94 | 0.24 | 0.43 | 0 | 0.0 | 0 | 0.0 | 1 | ▇▁▁▁▂ |
| Disorderly Conduct | 11 | 0.94 | 0.73 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| Suicide | 13 | 0.93 | 0.30 | 0.46 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▃ |
| Abuse | 14 | 0.92 | 1.33 | 2.12 | 0 | 0.0 | 0 | 2.0 | 7 | ▇▂▁▁▁ |
| Non-subst Dx | 22 | 0.87 | 0.44 | 0.68 | 0 | 0.0 | 0 | 1.0 | 2 | ▇▁▃▁▁ |
| Subst Dx | 23 | 0.87 | 1.14 | 0.93 | 0 | 0.0 | 1 | 2.0 | 3 | ▆▇▁▅▂ |
| Psych meds. | 118 | 0.33 | 0.96 | 0.80 | 0 | 0.0 | 1 | 2.0 | 2 | ▇▁▇▁▆ |
# Display first few rows for inspection
kable(head(df)) %>%
kable_styling(bootstrap_options = "basic")| Initial | Age | Sex | Race | ADHD Q1 | ADHD Q2 | ADHD Q3 | ADHD Q4 | ADHD Q5 | ADHD Q6 | ADHD Q7 | ADHD Q8 | ADHD Q9 | ADHD Q10 | ADHD Q11 | ADHD Q12 | ADHD Q13 | ADHD Q14 | ADHD Q15 | ADHD Q16 | ADHD Q17 | ADHD Q18 | ADHD Total | MD Q1a | MD Q1b | MD Q1c | MD Q1d | MD Q1e | MD Q1f | MD Q1g | MD Q1h | MD Q1i | MD Q1j | MD Q1k | MD Q1L | MD Q1m | MD Q2 | MD Q3 | MD TOTAL | Alcohol | THC | Cocaine | Stimulants | Sedative-hypnotics | Opioids | Court order | Education | Hx of Violence | Disorderly Conduct | Suicide | Abuse | Non-subst Dx | Subst Dx | Psych meds. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| JA | 24 | 1 | 1 | 1 | 1 | 4 | 2 | 3 | 1 | 1 | 3 | 2 | 4 | 2 | 4 | 1 | 0 | 3 | 1 | 3 | 4 | 40 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 3 | 15 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 11 | 0 | 1 | 1 | 0 | 2 | 0 | 2 |
| LA | 48 | 2 | 1 | 3 | 3 | 4 | 4 | 5 | 2 | 2 | 3 | 2 | 4 | 1 | 4 | 2 | 4 | 4 | 3 | 1 | 4 | 55 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 1 | 4 | 1 | 0 | 1 |
| MD | 51 | 2 | 1 | 2 | 1 | 2 | 1 | 3 | 3 | 3 | 2 | 0 | 1 | 2 | 0 | 2 | 2 | 3 | 2 | 1 | 1 | 31 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 6 | 2 | 0 | 1 |
| RD | 43 | 1 | 1 | 3 | 3 | 2 | 2 | 4 | 3 | 2 | 4 | 4 | 2 | 3 | 1 | 3 | 3 | 1 | 2 | 1 | 2 | 45 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 3 | 13 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 12 | 0 | 0 | 1 | 7 | 2 | 0 | 2 |
| RB | 34 | 1 | 1 | 4 | 4 | 2 | 4 | 4 | 2 | 3 | 4 | 4 | 2 | 4 | 1 | 3 | 2 | 1 | 2 | 1 | 1 | 48 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 7 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 9 | 1 | 1 | 1 | 0 | 2 | 0 | 0 |
| SB | 39 | 2 | 1 | 2 | 3 | 1 | 4 | 3 | 2 | 3 | 4 | 4 | 2 | 4 | 2 | 4 | 4 | 3 | 4 | 3 | 3 | 55 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 3 | 14 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 0 | 1 | 1 | 2 | 0 | 0 | 0 |
[TODO: Add any commentary about data that we see … missing, patterns, # rows, # cols, etc]
As we saw in our data summary, we have some missing values. Lets explore this a little further to see if any important patterns stand out.
# Use nanair package to plot missing value patterns
gg_miss_upset(df) The missing data only appears in a few of our features:
Suicide, Abuse, Non-subst Dx, Subst Dx, and Psych meds..
For factors, doing traditional imputing approaches can be problematic. While in most cases, missing can might be inferred as 0, we don’t know if there were problems with administering, coding the survey, and/or patients didn’t want to answer and skipped questions (when some of these may have applied). So, rather than dropping the rows or imputing, we will instead create new factor values ‘NA’ for each indicating that we lack that information.
We also see that for each respondent, we have their initial. This won’t be useful for our purposes (note there are also duplicates where multiple people had the same initials).
# Replace missing with 'unknown' so we have a factor value for each
df <- df %>%
na_replace('unknown')
# Drop Initial column
df <- df %>%
dplyr::select(-Initial)Looking at our dataset, we see that missing values are no longer an issue; however, we still have some additional cleanup to perform on our dataset below before we proceed with kNN modeling.
# Display data from summary
skim(df)| Name | df |
| Number of rows | 175 |
| Number of columns | 53 |
| _______________________ | |
| Column type frequency: | |
| character | 15 |
| numeric | 38 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Alcohol | 0 | 1 | 1 | 7 | 0 | 5 | 0 |
| THC | 0 | 1 | 1 | 7 | 0 | 5 | 0 |
| Cocaine | 0 | 1 | 1 | 7 | 0 | 5 | 0 |
| Stimulants | 0 | 1 | 1 | 7 | 0 | 4 | 0 |
| Sedative-hypnotics | 0 | 1 | 1 | 7 | 0 | 5 | 0 |
| Opioids | 0 | 1 | 1 | 7 | 0 | 4 | 0 |
| Court order | 0 | 1 | 1 | 7 | 0 | 3 | 0 |
| Education | 0 | 1 | 1 | 7 | 0 | 15 | 0 |
| Hx of Violence | 0 | 1 | 1 | 7 | 0 | 3 | 0 |
| Disorderly Conduct | 0 | 1 | 1 | 7 | 0 | 3 | 0 |
| Suicide | 0 | 1 | 1 | 7 | 0 | 3 | 0 |
| Abuse | 0 | 1 | 1 | 7 | 0 | 9 | 0 |
| Non-subst Dx | 0 | 1 | 1 | 7 | 0 | 4 | 0 |
| Subst Dx | 0 | 1 | 1 | 7 | 0 | 5 | 0 |
| Psych meds. | 0 | 1 | 1 | 7 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 39.47 | 11.17 | 18 | 29.5 | 42 | 48.0 | 69 | ▆▅▇▅▁ |
| Sex | 0 | 1 | 1.43 | 0.50 | 1 | 1.0 | 1 | 2.0 | 2 | ▇▁▁▁▆ |
| Race | 0 | 1 | 1.64 | 0.69 | 1 | 1.0 | 2 | 2.0 | 6 | ▇▁▁▁▁ |
| ADHD Q1 | 0 | 1 | 1.70 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▇▇▇▆▃ |
| ADHD Q2 | 0 | 1 | 1.91 | 1.25 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD Q3 | 0 | 1 | 1.91 | 1.27 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▇▇▆▅ |
| ADHD Q4 | 0 | 1 | 2.10 | 1.34 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▅▆ |
| ADHD Q5 | 0 | 1 | 2.26 | 1.44 | 0 | 1.0 | 3 | 3.0 | 5 | ▇▅▇▆▁ |
| ADHD Q6 | 0 | 1 | 1.91 | 1.31 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▅▇▇▃ |
| ADHD Q7 | 0 | 1 | 1.83 | 1.19 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▃▃ |
| ADHD Q8 | 0 | 1 | 2.14 | 1.29 | 0 | 1.0 | 2 | 3.0 | 4 | ▃▇▇▇▆ |
| ADHD Q9 | 0 | 1 | 1.91 | 1.32 | 0 | 1.0 | 2 | 3.0 | 4 | ▆▇▇▇▅ |
| ADHD Q10 | 0 | 1 | 2.12 | 1.23 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▇▇▆▅ |
| ADHD Q11 | 0 | 1 | 2.27 | 1.24 | 0 | 1.0 | 2 | 3.0 | 4 | ▂▆▇▇▆ |
| ADHD Q12 | 0 | 1 | 1.29 | 1.21 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▂▂ |
| ADHD Q13 | 0 | 1 | 2.37 | 1.23 | 0 | 1.5 | 2 | 3.0 | 4 | ▂▅▇▇▆ |
| ADHD Q14 | 0 | 1 | 2.25 | 1.35 | 0 | 1.0 | 2 | 3.0 | 4 | ▅▅▇▇▆ |
| ADHD Q15 | 0 | 1 | 1.63 | 1.39 | 0 | 0.0 | 1 | 3.0 | 4 | ▇▆▆▅▃ |
| ADHD Q16 | 0 | 1 | 1.70 | 1.38 | 0 | 1.0 | 1 | 3.0 | 4 | ▆▇▆▃▅ |
| ADHD Q17 | 0 | 1 | 1.53 | 1.29 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▇▃▃ |
| ADHD Q18 | 0 | 1 | 1.47 | 1.30 | 0 | 0.0 | 1 | 2.0 | 4 | ▇▇▆▃▃ |
| ADHD Total | 0 | 1 | 34.32 | 16.70 | 0 | 21.0 | 33 | 47.5 | 72 | ▃▆▇▆▂ |
| MD Q1a | 0 | 1 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1b | 0 | 1 | 0.57 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1c | 0 | 1 | 0.54 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q1d | 0 | 1 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1e | 0 | 1 | 0.55 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1f | 0 | 1 | 0.70 | 0.46 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q1g | 0 | 1 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q1h | 0 | 1 | 0.56 | 0.50 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1i | 0 | 1 | 0.59 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1j | 0 | 1 | 0.39 | 0.49 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▅ |
| MD Q1k | 0 | 1 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q1L | 0 | 1 | 0.58 | 0.49 | 0 | 0.0 | 1 | 1.0 | 1 | ▆▁▁▁▇ |
| MD Q1m | 0 | 1 | 0.49 | 0.50 | 0 | 0.0 | 0 | 1.0 | 1 | ▇▁▁▁▇ |
| MD Q2 | 0 | 1 | 0.72 | 0.45 | 0 | 0.0 | 1 | 1.0 | 1 | ▃▁▁▁▇ |
| MD Q3 | 0 | 1 | 2.01 | 1.07 | 0 | 1.0 | 2 | 3.0 | 3 | ▂▂▁▅▇ |
| MD TOTAL | 0 | 1 | 10.02 | 4.81 | 0 | 6.5 | 11 | 14.0 | 17 | ▃▃▆▇▇ |
We know that most of our features are either factors or ordinal factors. Let’s recode from float to factor to ensure our models don’t incorrectly treat them as continuous. Note that we’ll leave Age as continuous even though it’s technically an ordinal factor. Also, for simplicity, boolean features (only 0 or 1) can be left as interger. Looking ahead, factors are going to lead to a large number of features when we dummy code in preparation for modeling. Dummy encoding leads to separate columns for each value. High dimensions can lead to problems [TODO: insert discussion as to why here]. Fortunately, we don’t have that many features and values, so we aren’t starting with an excessive number of coumns. That said, we will likely need to either reduce dimensions thorough PCA or different recoding approaches.
# We'll leave age as an integer
df$Age <- as.integer(df$Age)
# Misc columns (note: some columns are ordinal, others are not)
df$Sex <- factor(df$Sex)
df$Race <- factor(df$Race)
df$Alcohol <- factor(df$Alcohol, levels=c("unknown", 0, 1, 2, 3))
df$THC <- factor(df$THC, levels=c('unknown', 0, 1, 2, 3))
df$Cocaine <- factor(df$Cocaine, levels=c('unknown', 0, 1, 2, 3))
df$Stimulants <- factor(df$Stimulants, levels=c('unknown', 0, 1, 2, 3))
df$`Sedative-hypnotics` <- factor(df$`Sedative-hypnotics`, levels=c('unknown', 0, 1, 2, 3))
df$Opioids <- factor(df$Opioids, levels=c('unknown', 0, 1, 2, 3))
df$Education <- factor(df$Education, levels=c('unknown', 0, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20))
df$`Court order` <- factor(df$`Court order`, levels=c('unknown', 0, 1))
df$`Hx of Violence` <- factor(df$`Hx of Violence`, levels=c('unknown', 0, 1))
df$`Disorderly Conduct` <- factor(df$`Disorderly Conduct`, levels=c('unknown', 0, 1))
df$Suicide <- factor(df$Suicide, levels=c('unknown', 0, 1))
df$Abuse <- factor(df$Abuse, levels=c('unknown', 0, 1, 2, 3, 4,5, 6, 7))
df$`Non-subst Dx` <- factor(df$`Non-subst Dx`, levels=c('unknown', 0, 1, 2))
df$`Subst Dx` <- factor(df$`Subst Dx`, levels = c('unknown', 0, 1, 2, 3))
df$`Psych meds.` <- factor(df$`Psych meds.`, levels=c('unknown', 0, 1, 2))
# Quick trick to fix all columns matching name patter
df <- df %>%
mutate(across(contains('ADHD'), factor)) %>%
mutate(across(contains('MD Q'), factor))
# Back these out since they are totals (sum), not factor
df$`ADHD Total` <- as.integer(df$`ADHD Total`)
df$`MD TOTAL` <- as.integer(df$`MD TOTAL`)For the continuous variables, we can examine the distributions, broken out by the target variable, Sex.
[TODO: This was quick and dirty placeholder … Provide better visualizations, possibly by feature category (e.e.g, ADHD, MD, Substance Abuce columns, etc)]
[TODO: Insert CorrPlot? Since he have high dimensions, it may be problematic to visualize. Maybe if we facet? not sure]
Placeholder (and useless) image inserted below.
library(ggcorrplot)
model.matrix(~., data=df) %>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2)[TODO: Describe any clear feature distributions or correlations patterns]
We’ll need to dummy code our categorical variables. This process will create new columns for each value and assign a 0 or 1. Note that dummy encoding typically drops one value which becomes the baseline. So if we have a categorical feature with five unique values, we will have four columns. If all columns are 0, that represents the reference value. This helps reduce the number of necessary columns. With dummy columns in place, we need to remove our original variables from this dataset.
# dummy encode our categorical features
df_dummy <- dummyVars(~ 0 + ., drop2nd=TRUE, data = df)
df_dummy <- data.frame(predict(df_dummy, newdata = df))
# Uncomment if we want to verify the dummy df summary
#skim(df_dummy)We should remove any columns with zero variance. Note, since we are working with a sparse dataset, we do NOT want to remove near zero variance, jsut zer variance.
nzv <- nearZeroVar(df_dummy, saveMetrics= TRUE)
(nzv %>% filter(zeroVar == T))## freqRatio percentUnique zeroVar nzv
## Stimulants.2 0 0.5714286 TRUE TRUE
## Opioids.2 0 0.5714286 TRUE TRUE
## Education.0 0 0.5714286 TRUE TRUE
## Education.1 0 0.5714286 TRUE TRUE
## Education.2 0 0.5714286 TRUE TRUE
## Education.3 0 0.5714286 TRUE TRUE
## Education.4 0 0.5714286 TRUE TRUE
## Education.5 0 0.5714286 TRUE TRUE
## Education.20 0 0.5714286 TRUE TRUE
# drop columns with zero variance
nzv <- nearZeroVar(df_dummy)
df_dummy <- df_dummy[, -nzv][TODO: Should we revisit correlations between factor variable? Describe any clear feature distributions or correlations patterns]
[TODO: Show feature value correlations]
[TODO: Discuss multicolineatity and determine if any features are candidates to be dropped up front.]
If values within factors are highly imbalanced, then we can get bias where specific values are better repreented an thus algorithms have more datapoints to learn from.
[TODO: If any strong class imbalance of values with in a feature, we might need to bootstrap or resample to address]
Next, we’ll do some data tidying to get our data set ready for our KNN model. [TODO: Insert discussion on any features to remove, keep, data type changes, needs transforms, etc]
# remove any features
#df <- df %>%
# dplyr::select(-c())Age is our only continuous variable. We normalize our continuous features using a simple z-score standardization. Although this may help our KNN model, we’ll have to be careful about interpretability later on!
# z-score scale continuous features
df[, c("Age")] <- scale(df[, c("Age")])Our data set is ready for our unsupervised models.
# quick inspect of dataframe
kable(head(df_dummy)) %>%
kable_styling(bootstrap_options = "basic")| Age | Sex.1 | Sex.2 | Race.1 | Race.2 | X.ADHD.Q1.0 | X.ADHD.Q1.1 | X.ADHD.Q1.2 | X.ADHD.Q1.3 | X.ADHD.Q1.4 | X.ADHD.Q2.0 | X.ADHD.Q2.1 | X.ADHD.Q2.2 | X.ADHD.Q2.3 | X.ADHD.Q2.4 | X.ADHD.Q3.0 | X.ADHD.Q3.1 | X.ADHD.Q3.2 | X.ADHD.Q3.3 | X.ADHD.Q3.4 | X.ADHD.Q4.0 | X.ADHD.Q4.1 | X.ADHD.Q4.2 | X.ADHD.Q4.3 | X.ADHD.Q4.4 | X.ADHD.Q5.0 | X.ADHD.Q5.1 | X.ADHD.Q5.2 | X.ADHD.Q5.3 | X.ADHD.Q5.4 | X.ADHD.Q6.0 | X.ADHD.Q6.1 | X.ADHD.Q6.2 | X.ADHD.Q6.3 | X.ADHD.Q6.4 | X.ADHD.Q7.0 | X.ADHD.Q7.1 | X.ADHD.Q7.2 | X.ADHD.Q7.3 | X.ADHD.Q7.4 | X.ADHD.Q8.0 | X.ADHD.Q8.1 | X.ADHD.Q8.2 | X.ADHD.Q8.3 | X.ADHD.Q8.4 | X.ADHD.Q9.0 | X.ADHD.Q9.1 | X.ADHD.Q9.2 | X.ADHD.Q9.3 | X.ADHD.Q9.4 | X.ADHD.Q10.0 | X.ADHD.Q10.1 | X.ADHD.Q10.2 | X.ADHD.Q10.3 | X.ADHD.Q10.4 | X.ADHD.Q11.0 | X.ADHD.Q11.1 | X.ADHD.Q11.2 | X.ADHD.Q11.3 | X.ADHD.Q11.4 | X.ADHD.Q12.0 | X.ADHD.Q12.1 | X.ADHD.Q12.2 | X.ADHD.Q12.3 | X.ADHD.Q12.4 | X.ADHD.Q13.0 | X.ADHD.Q13.1 | X.ADHD.Q13.2 | X.ADHD.Q13.3 | X.ADHD.Q13.4 | X.ADHD.Q14.0 | X.ADHD.Q14.1 | X.ADHD.Q14.2 | X.ADHD.Q14.3 | X.ADHD.Q14.4 | X.ADHD.Q15.0 | X.ADHD.Q15.1 | X.ADHD.Q15.2 | X.ADHD.Q15.3 | X.ADHD.Q15.4 | X.ADHD.Q16.0 | X.ADHD.Q16.1 | X.ADHD.Q16.2 | X.ADHD.Q16.3 | X.ADHD.Q16.4 | X.ADHD.Q17.0 | X.ADHD.Q17.1 | X.ADHD.Q17.2 | X.ADHD.Q17.3 | X.ADHD.Q17.4 | X.ADHD.Q18.0 | X.ADHD.Q18.1 | X.ADHD.Q18.2 | X.ADHD.Q18.3 | X.ADHD.Q18.4 | X.ADHD.Total. | X.MD.Q1a.0 | X.MD.Q1a.1 | X.MD.Q1b.0 | X.MD.Q1b.1 | X.MD.Q1c.0 | X.MD.Q1c.1 | X.MD.Q1d.0 | X.MD.Q1d.1 | X.MD.Q1e.0 | X.MD.Q1e.1 | X.MD.Q1f.0 | X.MD.Q1f.1 | X.MD.Q1g.0 | X.MD.Q1g.1 | X.MD.Q1h.0 | X.MD.Q1h.1 | X.MD.Q1i.0 | X.MD.Q1i.1 | X.MD.Q1j.0 | X.MD.Q1j.1 | X.MD.Q1k.0 | X.MD.Q1k.1 | X.MD.Q1L.0 | X.MD.Q1L.1 | X.MD.Q1m.0 | X.MD.Q1m.1 | X.MD.Q2.0 | X.MD.Q2.1 | X.MD.Q3.0 | X.MD.Q3.1 | X.MD.Q3.2 | X.MD.Q3.3 | X.MD.TOTAL. | Alcohol.0 | Alcohol.1 | Alcohol.3 | THC.0 | THC.1 | THC.3 | Cocaine.0 | Cocaine.1 | Cocaine.3 | Stimulants.0 | X.Sedative.hypnotics.0 | Opioids.0 | Opioids.3 | X.Court.order.0 | X.Court.order.1 | Education.unknown | Education.9 | Education.10 | Education.11 | Education.12 | Education.13 | Education.14 | X.Hx.of.Violence.unknown | X.Hx.of.Violence.0 | X.Hx.of.Violence.1 | X.Disorderly.Conduct.unknown | X.Disorderly.Conduct.0 | X.Disorderly.Conduct.1 | Suicide.unknown | Suicide.0 | Suicide.1 | Abuse.unknown | Abuse.0 | Abuse.2 | Abuse.5 | X.Non.subst.Dx.unknown | X.Non.subst.Dx.0 | X.Non.subst.Dx.1 | X.Non.subst.Dx.2 | X.Subst.Dx.unknown | X.Subst.Dx.0 | X.Subst.Dx.1 | X.Subst.Dx.2 | X.Subst.Dx.3 | X.Psych.meds..unknown | X.Psych.meds..0 | X.Psych.meds..1 | X.Psych.meds..2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 37 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 15 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 48 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 52 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 14 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 51 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 28 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 5 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 43 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 42 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 13 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 34 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 45 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 7 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 39 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 52 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 14 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Please use a clustering method to find clusters of patients here. Whether you choose to use k-means clustering or hierarchical clustering is up to you as long as you reason through your work. Can you come up with creative names for the profiles you found? (60)
[TODO: Describe what K-Means is and what we will do]
We’ll need to identify the appropriate number for k. k represents the number of clusters we will group rows into. Three common techniques are: Elbow, Silhouette, Gap statistic, and NBClust(). For simplicity, we will uyse the Elbow approach. [TODO: Insert disscussion]
# Elbow method
fviz_nbclust(df_dummy, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) + # add line for better visualization
labs(subtitle = "Elbow method") # add subtitleWe can see above, that it is approximately 5. More clusters do not improve within segment variability. Therefore, we’ll perform our initial K-Means model with \(k=5\).
set.seed(42)
km_res <- kmeans(df_dummy, centers = 5, nstart = 20)
summary(km_res)## Length Class Mode
## cluster 175 -none- numeric
## centers 885 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
sil <- silhouette(km_res$cluster, dist(df_dummy))
fviz_silhouette(sil)## cluster size ave.sil.width
## 1 1 34 0.33
## 2 2 18 0.33
## 3 3 43 0.29
## 4 4 39 0.33
## 5 5 41 0.24
fviz_cluster(km_res, df_dummy, ellipse.type = "norm")[TODO: Discussion of model results]
[TODO: Describe what HC is and what we will do]
[TODO: Discussion of model results]
[EDA of Segments - use variable importance to see which features (factor and value) were most important, then see if there are clear groupings]
[TODO: Overall Discussion of segments]
Let’s explore using Principal Component Analysis on this dataset. You will note that there are different types of questions in the dataset: columns E-W: ADHD self-report; column X–AM: mood disorders questionnaire, column AN-AS: Individual Substance Misuse; etc. Please reason through your work as you decide on which sets of variables you want to use to conduct Principal Component Analysis. (60)
df_dummy.pca <- prcomp(df_dummy, center = TRUE,scale. = TRUE)
summary(df_dummy.pca)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.4894 3.20295 2.77590 2.62572 2.30642 2.10226 2.02049
## Proportion of Variance 0.1139 0.05796 0.04353 0.03895 0.03005 0.02497 0.02306
## Cumulative Proportion 0.1139 0.17183 0.21537 0.25432 0.28437 0.30934 0.33240
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.98530 1.93805 1.89249 1.83511 1.76704 1.75229 1.71581
## Proportion of Variance 0.02227 0.02122 0.02023 0.01903 0.01764 0.01735 0.01663
## Cumulative Proportion 0.35467 0.37589 0.39613 0.41515 0.43279 0.45014 0.46677
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 1.69830 1.67791 1.64408 1.60241 1.57614 1.53845 1.5343
## Proportion of Variance 0.01629 0.01591 0.01527 0.01451 0.01404 0.01337 0.0133
## Cumulative Proportion 0.48307 0.49898 0.51425 0.52875 0.54279 0.55616 0.5695
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 1.51049 1.48294 1.47207 1.44851 1.42889 1.41373 1.3888
## Proportion of Variance 0.01289 0.01242 0.01224 0.01185 0.01154 0.01129 0.0109
## Cumulative Proportion 0.58235 0.59477 0.60702 0.61887 0.63041 0.64170 0.6526
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 1.3695 1.34982 1.32913 1.31066 1.30210 1.27977 1.27124
## Proportion of Variance 0.0106 0.01029 0.00998 0.00971 0.00958 0.00925 0.00913
## Cumulative Proportion 0.6632 0.67349 0.68347 0.69317 0.70275 0.71200 0.72113
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 1.26014 1.23169 1.21476 1.20622 1.19689 1.17444 1.15464
## Proportion of Variance 0.00897 0.00857 0.00834 0.00822 0.00809 0.00779 0.00753
## Cumulative Proportion 0.73011 0.73868 0.74701 0.75523 0.76333 0.77112 0.77865
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 1.13430 1.11718 1.09868 1.08651 1.07766 1.06357 1.04314
## Proportion of Variance 0.00727 0.00705 0.00682 0.00667 0.00656 0.00639 0.00615
## Cumulative Proportion 0.78592 0.79297 0.79979 0.80646 0.81302 0.81941 0.82556
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 1.02718 1.01907 1.00972 0.99954 0.97391 0.95617 0.94644
## Proportion of Variance 0.00596 0.00587 0.00576 0.00564 0.00536 0.00517 0.00506
## Cumulative Proportion 0.83152 0.83739 0.84315 0.84879 0.85415 0.85932 0.86438
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.92962 0.91885 0.90686 0.88778 0.88067 0.87013 0.86527
## Proportion of Variance 0.00488 0.00477 0.00465 0.00445 0.00438 0.00428 0.00423
## Cumulative Proportion 0.86926 0.87403 0.87868 0.88313 0.88751 0.89179 0.89602
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.85583 0.84744 0.84228 0.82681 0.81179 0.7869 0.78045
## Proportion of Variance 0.00414 0.00406 0.00401 0.00386 0.00372 0.0035 0.00344
## Cumulative Proportion 0.90016 0.90422 0.90822 0.91209 0.91581 0.9193 0.92275
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.75570 0.74874 0.74247 0.72263 0.71918 0.70869 0.69445
## Proportion of Variance 0.00323 0.00317 0.00311 0.00295 0.00292 0.00284 0.00272
## Cumulative Proportion 0.92598 0.92914 0.93226 0.93521 0.93813 0.94097 0.94369
## PC78 PC79 PC80 PC81 PC82 PC83 PC84
## Standard deviation 0.6914 0.68394 0.67353 0.65629 0.6379 0.6246 0.61415
## Proportion of Variance 0.0027 0.00264 0.00256 0.00243 0.0023 0.0022 0.00213
## Cumulative Proportion 0.9464 0.94904 0.95160 0.95403 0.9563 0.9585 0.96067
## PC85 PC86 PC87 PC88 PC89 PC90 PC91
## Standard deviation 0.61093 0.59934 0.59026 0.5799 0.56819 0.55491 0.54633
## Proportion of Variance 0.00211 0.00203 0.00197 0.0019 0.00182 0.00174 0.00169
## Cumulative Proportion 0.96278 0.96480 0.96677 0.9687 0.97050 0.97224 0.97392
## PC92 PC93 PC94 PC95 PC96 PC97 PC98
## Standard deviation 0.53117 0.5159 0.50429 0.49417 0.47779 0.47247 0.46337
## Proportion of Variance 0.00159 0.0015 0.00144 0.00138 0.00129 0.00126 0.00121
## Cumulative Proportion 0.97552 0.9770 0.97846 0.97984 0.98113 0.98239 0.98360
## PC99 PC100 PC101 PC102 PC103 PC104 PC105
## Standard deviation 0.44758 0.43428 0.42540 0.41019 0.40229 0.40193 0.38187
## Proportion of Variance 0.00113 0.00107 0.00102 0.00095 0.00091 0.00091 0.00082
## Cumulative Proportion 0.98473 0.98580 0.98682 0.98777 0.98869 0.98960 0.99042
## PC106 PC107 PC108 PC109 PC110 PC111 PC112
## Standard deviation 0.3767 0.35583 0.34609 0.34245 0.33205 0.32181 0.31356
## Proportion of Variance 0.0008 0.00072 0.00068 0.00066 0.00062 0.00059 0.00056
## Cumulative Proportion 0.9912 0.99194 0.99262 0.99328 0.99390 0.99449 0.99504
## PC113 PC114 PC115 PC116 PC117 PC118 PC119
## Standard deviation 0.30079 0.2988 0.28458 0.2669 0.25949 0.24660 0.23668
## Proportion of Variance 0.00051 0.0005 0.00046 0.0004 0.00038 0.00034 0.00032
## Cumulative Proportion 0.99555 0.9961 0.99652 0.9969 0.99730 0.99764 0.99796
## PC120 PC121 PC122 PC123 PC124 PC125 PC126
## Standard deviation 0.22643 0.22324 0.21772 0.19814 0.1873 0.16559 0.15200
## Proportion of Variance 0.00029 0.00028 0.00027 0.00022 0.0002 0.00015 0.00013
## Cumulative Proportion 0.99825 0.99853 0.99880 0.99902 0.9992 0.99937 0.99950
## PC127 PC128 PC129 PC130 PC131 PC132 PC133
## Standard deviation 0.14085 0.1347 0.1308 0.09985 0.08419 0.08024 0.06921
## Proportion of Variance 0.00011 0.0001 0.0001 0.00006 0.00004 0.00004 0.00003
## Cumulative Proportion 0.99962 0.9997 0.9998 0.99987 0.99991 0.99995 0.99997
## PC134 PC135 PC136 PC137 PC138 PC139
## Standard deviation 0.05072 0.03287 0.0245 0.01907 1.994e-15 1.116e-15
## Proportion of Variance 0.00001 0.00001 0.0000 0.00000 0.000e+00 0.000e+00
## Cumulative Proportion 0.99999 0.99999 1.0000 1.00000 1.000e+00 1.000e+00
## PC140 PC141 PC142 PC143 PC144
## Standard deviation 1.004e-15 1.001e-15 7.869e-16 7.77e-16 7.601e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.00e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.00e+00 1.000e+00
## PC145 PC146 PC147 PC148 PC149
## Standard deviation 6.118e-16 5.907e-16 5.039e-16 4.856e-16 4.186e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC150 PC151 PC152 PC153 PC154 PC155
## Standard deviation 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
## Cumulative Proportion 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00
## PC156 PC157 PC158 PC159 PC160 PC161
## Standard deviation 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
## Cumulative Proportion 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00
## PC162 PC163 PC164 PC165 PC166 PC167
## Standard deviation 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16 3.56e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00
## Cumulative Proportion 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.00e+00
## PC168 PC169 PC170 PC171 PC172 PC173
## Standard deviation 3.56e-16 3.56e-16 3.56e-16 2.45e-16 2.359e-16 2.232e-16
## Proportion of Variance 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.00e+00 1.00e+00 1.00e+00 1.00e+00 1.000e+00 1.000e+00
## PC174 PC175
## Standard deviation 2.072e-16 1.684e-16
## Proportion of Variance 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00
biplot(df_dummy.pca, scale = 0)#compute standard deviation of each principal component
std_dev <- df_dummy.pca$sdev
#compute variance
pr_var <- std_dev^2
#check variance of first 10 components
pr_var[1:10]## [1] 20.155101 10.258863 7.705646 6.894418 5.319577 4.419484 4.082382
## [8] 3.941416 3.756023 3.581516
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:10]## [1] 0.11387063 0.05795968 0.04353473 0.03895151 0.03005411 0.02496884
## [7] 0.02306430 0.02226789 0.02122047 0.02023455
#scree plot
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
type = "b")Assume you are modeling whether a patient attempted suicide (column AX). Please use support vector machine to model this. You might want to consider reducing the number of variables or somehow use extracted information from the variables. This can be a really fun modeling task! (80)